Skip to main content

View Post [edit]

Poster: christ_chan64 Date: Oct 23, 2018 5:14am
Forum: texts Subject: Re: Recurrent typos in Lord of the Rings

Not an Archive employee, but the djvu.txt file is a derivative file.

Basically, what happens is that when a PDF is uploaded, page images are extracted from the PDF, and then those images are read by an optical character recognition (OCR) software called ABBYY, and the readout is saved as a file named filename_abbyy.gz. Every other format is ultimately derived from this OCR readout, including the _djvu.txt file.

Due to the imperfect nature of OCR technology, the readout is rarely 100% accurate and errors will creep in.

I doubt Archive employees have any time to be correcting errors in OCR-ed text. If it was your item, I would suggest perhaps downloading the text file, correcting the errors and reuploading it with the same filename. (Not sure if the Archive software would overwrite the changes, though.).

Alternatively, you could make the corrections yourself and upload it as a separate item.